https://github.com/PhilChodrow/data_science_intro

#### Intro stuff to be loaded in during part 1 ####
# setwd("~/data_science_intro/case_study")
# install.packages("tidyverse")
library(tidyverse)

# Read in listings data set, and convert price column to numeric
listings = read.csv("../data/listings.csv")
listings$price = as.numeric(gsub('\\$|,','',as.character(listings$price)))

# Read in the reviews data set, making sure to set stringsAsFactors=FALSE
reviews = read.csv("../data/reviews.csv", stringsAsFactors = FALSE)

Natural Language Processing

View the data from reviews.csv. What does each row represent?

head(reviews, 3)
##   listing_id      id       date reviewer_id reviewer_name
## 1    1178162 4724140 2013-05-21     4298113       Olivier
## 2    1178162 4869189 2013-05-29     6452964     Charlotte
## 3    1178162 5003196 2013-06-06     6449554     Sebastian
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                 comments
## 1                                                                                                                                                                                                 My stay at islam's place was really cool! Good location, 5min away from subway, then 10min from downtown. The room was nice, all place was clean. Islam managed pretty well our arrival, even if it was last minute ;) i do recommand this place to any airbnb user :)
## 2                                                                                                                                                                                                                                                                                                                                    Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful even though he was away
## 3 We really enjoyed our stay at Islams house. From the outside the house didn't look so inviting but the inside was very nice! Even though Islam himself was not there everything was prepared for our arrival. The airport T Station is only a 5-10 min walk away. The only little issue was that all the people in the house had to share one bathroom. But it was not really a problem and it worked out fine. We would recommend Islams place for a stay in Boston.

Display the top 10 most reviewed Airbnb listings using the reviews data frame. Are the counts consistent with the data in the listings data frame?

sort(listings$number_of_reviews, decreasing = TRUE)[1:10]
sort(table(reviews$listing_id), decreasing = TRUE)[1:10]

Later on, we will want to merge the two data frames: listings and reviews. The ID variables that we will use for this merge operation are listings$id and reviews$listing_id. It is important to understand the data structure before performing the analysis. Both data frames have 2829 unique listings with >= 1 review - let’s confirm this fact.

length(unique(listings$id))
nrow(filter(listings, number_of_reviews>0))
length(unique(reviews$listing_id))

We will take review_scores_rating as the dependent variable that we are trying to predict. This is the average customer rating of the Airbnb listing, on a scale of 0-100. Plot a simple histogram of review_scores_rating, and count the number of values != NA.

hist(listings$review_scores_rating)
sum(!is.na(listings$review_scores_rating))

Next, create a new data frame with just the review scores data from listings.csv. Filter out rows with review_scores_rating=NA.

listings_scores = listings %>% 
  filter(number_of_reviews > 0) %>%
  select("LISTING_ID"=id, "RATING"=review_scores_rating) %>%
  filter(!is.na(RATING))
str(listings_scores)
## 'data.frame':    2772 obs. of  2 variables:
##  $ LISTING_ID: int  3075044 6976 1436513 7651065 12386020 5706985 2843445 753446 849408 12023024 ...
##  $ RATING    : int  94 98 100 99 100 90 96 96 94 80 ...

Exercise 2.1: Writing a simple function in R
The syntax for writing the function f(x) = x^2 is

f <- function(x){
  return(x*x)
}

Write a function to convert the listing rating from a scale of 0-100 to (“Terrible”,“Low”,“Mid”,“High”,“Perfect”).
Given an integer input rating from 0-100, the function should output:
“Perfect” if rating = 100
“High” if 95 <= rating < 99
“Mid” if 90 <= rating < 94
“Low” if 80 <= rating < 89
“Terrible” if rating <= 79
For example: convert_rating(64) should output “Terrible”

Solution:

convert_rating <- function(rating){
  if(rating == 100){
    return("Perfect")
  }else if(rating >= 95){
    return("High")
  }else if(rating >= 90){
    return("Mid")
  }else if(rating >= 80){
    return("Low")
  }else{
    return("Terrible")
  }
}

Test a few values to make sure that the function works

convert_rating(100)
convert_rating(98)
convert_rating(90)
convert_rating(82)
convert_rating(46)

To apply the convert_rating() function to each element in an array, we need to “vectorize” the function first. Avoid using for-loops in R whenever possible because those are slow.

v_convert_rating <- Vectorize(convert_rating, c("rating"))
# Test a few values to make sure that the function works.
v_convert_rating(c(100,32,87))
## [1] "Perfect"  "Terrible" "Low"

Compute the new column using a mutate call.

listings_scores <- listings_scores %>%
  mutate(RATING_TEXT = v_convert_rating(RATING))

# Take a look
table(listings_scores$RATING_TEXT)
## 
##     High      Low      Mid  Perfect Terrible 
##      754      545      670      628      175

These groupings look relatively well-balanced, which is desirable. For the NLP task, we will try to predict this rating category based upon the text data from reviews.csv.

Let’s go back to reviews data frame.

str(reviews)
## 'data.frame':    68275 obs. of  6 variables:
##  $ listing_id   : int  1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 ...
##  $ id           : int  4724140 4869189 5003196 5150351 5171140 5198929 6702817 6873023 7646702 8094418 ...
##  $ date         : chr  "2013-05-21" "2013-05-29" "2013-06-06" "2013-06-15" ...
##  $ reviewer_id  : int  4298113 6452964 6449554 2215611 6848427 6663826 8099222 7671888 8197342 9040491 ...
##  $ reviewer_name: chr  "Olivier" "Charlotte" "Sebastian" "Marine" ...
##  $ comments     : chr  "My stay at islam's place was really cool! Good location, 5min away from subway, then 10min from downtown. The r"| __truncated__ "Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful eve"| __truncated__ "We really enjoyed our stay at Islams house. From the outside the house didn't look so inviting but the inside w"| __truncated__ "The room was nice and clean and so were the commodities. Very close to the airport metro station and located in"| __truncated__ ...

Currently, we have a data frame with 68275 rows. We would like to have a data frame with 2829 rows - one per each listing. We can use the group_by() and summarize() functions to transform the data frame in this way.

reviews_by_listing = reviews %>%
  select(listing_id,comments) %>%
  group_by(listing_id) %>%
  summarize(all_comments=paste(comments,collapse=" "))

# Check out the updated data frame - 2829 rows.
str(reviews_by_listing)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2829 obs. of  2 variables:
##  $ listing_id  : int  3353 5506 6695 6976 8792 9273 9765 9824 9855 9857 ...
##  $ all_comments: chr  "Very friendly and helpful. Convenient location. The location is great as it's right next to the Green T stop an"| __truncated__ "Terry's Hotel Alterntv in Boston was a perfect place to stay for myself and my partner.  We mixed our trip with"| __truncated__ "Terry's apartment is beautifully decorated and feels like a home away from home. Kudos especially to the kitche"| __truncated__ "A Wonderful, pleasant, and charming host.  The bed is very comfortable and the room is nice. Travel wise one is"| __truncated__ ...

View the first listing’s comments.

reviews_by_listing$all_comments[1]

Observations? What are some problems that we might run into with bag-of-words?

Natural Language Processing slides

Now, we are ready to construct the Bag-of-Words with Airbnb reviews for the prediction task. The following step-by-step procedure for building the Bag-of-Words is adapted from MIT EdX - Analytics Edge, Lecture 8.

Step 0: Install and load two packages for pre-processing:

# install.packages("tm")
library(tm)
# install.packages("SnowballC")
library(SnowballC)

Step 1: Convert reviewer comments to a corpus, automatically processing escape characters like “\n”.
Step 2: Change all the text to lower case, and convert to “PlainTextDocument” (required after to_lower function).
Step 3: Remove all punctuation.
Step 4: Remove stop words (this step may take a minute).
Step 5: Stem our document.

corpus = VCorpus(VectorSource(reviews_by_listing$all_comments)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(stemDocument)

# Take a look
corpus[[1]]$content
## [1] "friend help conveni locat locat great right next green t stop mani food place nearbi room small fit queen mattress almost noth els day use room night work host friend help giuesepp cordial friend nice other apart made avail meet late night check room good valu money met criteria central safe clean privat look recommend without reserv good host well roommat special thank mile hous nice locat block green line t station bus stop sever restaur shop food store neighborbood room somehow awkward great potenti improv anyhow certain worthwhil apart right middl everyth center allston roommat top qualiti laundri free cant beat giusepp nice host place cozi conveni two block away t rest roommat friend well room bit odd oldish pretti clean quiet overal enjoy stay 12 sidenot arriv late day due unforeseen travel circumst agre refund 1st night wait patient rather late midnightish arriv great stay new graduat student boston univers room ideal locat close bu easi access t travel elsewher town room clean bed comfi apart quiet exact look will stay futur comfort everyon apart ccol giusepp super friend help locat great stay couldnt easier giusepp roommat great nice apart quit area short walk t bed quit comfort giusepp nice guy help carri languag first arriv usual home work lot three roommat one student bu man tast giusepp nice cordial host room bit small look privat afort place sleep worthwhil small cramp room unfriend host care guest go get key apart breakfast even coffe rent room giusepp great host made feel comfort right begin locat great right t station market restaur just short walk away great locat room also comfort privat toilet sink clean maintain apart quiet surround neighborhood safe hesit recommend place stay guiseep nice respons inquiri place ideal locat room small accur advertis your look place just crash plan nice cheap option 2 housem met nice respect rent car need park 2 block street overnight 8am morn need start put money meter move car small inconveni big deal anyhow place safe secur giusepp met give key place appreci stay exceed expect great locat clean comfort abl reach giusepp anytim question wifi password instanc room rather small price right high recommend stay 5 min walk green line t stop place easi get seem pretti safe area room small advertis perfect singl travel budget giusepp realli friend respons host great stay nice room small clear ad clean confort bed locat just perfect apart quiet street close shop tube green line easi get key host guy quiet friend loud nois certain stay need place boston giusepp great host help respons cool guy room just look small cozi relat quiet fast wifi nice sink room well stay almost month definit stay say anyth els like talk hostgiusepp one damn help guy respons alway just call away need help abroad first trip boston hard face difficulti thank realli help advic giusepp regard place transport restaur etc talk room pretti much read descript privat toiletnot bath sink bed comfort got clean linen blanket dont get furnitur apart bed like tabl chair said earlier pretti small room room mate quiet usual busi dont surpris hous time talk locat numer restaur just coupl min walk home brighton ave lot place eat drink nearbi far transport concern nearest bus stop just 5 minut away walk student go longwood medic area everyday difficulti whatsoev room locat great zero issu 1 month stay recommend look cheap accommod safe happen locat room made small closet storag bed bedhead sink mirror share space general good shape reason spacious perhap weak lack vacuum meant carpet live room becam surpris stuf particl host effici took pain good oper often help provid empti metro card unfortun prove imposs reach afterward frustrat seem left someth valuabl repeat mail gone unansw good valu money caveat realli enjoy 1 month stay properti although room small specifi descript enough storag space short stay bed comfort sink room conveni loung area kitchen nice space hang lot natur light locat great train station supermarket barsrestaur amaz asian food court just minut walk away giusepp hous mate realli friend help giusepp great profession host welcom guest warm care everyth describ bed quit comfort room small enough space luggag person belong taken luggag two drawer room well water sink can wash hand room desk chair roombut shown photo room alreadi live area kitchen larg comfort everyth typic will find area readi use bathroom typic one overal satisfi stay will come futur realli enjoy stay room nice big bed comfort didnt see giusepp much alway email away super help recommend anyon definet stay although physic present apart guisepp alway avail phone email alway quick answer request deal issu pertain run apart descript room quit accur yes room quit small fine plan use room sleep found inclus sink room quit help particular share singl bathroom 4 other apart good access groceri store public transport meant easi neighbourhood live apart half hour t trip heart boston half hour bus trip cambridg peak hour good budget option visit either locat thorough enjoy stay place apart clean spacious nice common space cook eat chill bedroom small comfort definit suitabl one person place locat within close walk distanc two supermarket various shop t bus station also conveni locat reachabl within 5 minut giusepp fun respons host eager make stay pleasant smooth possibl can without hesit recommend airbnb good place recommend stay giusepp share flat two month room exact describ felt welcom like live share flat home everyon task can use whole flat giusepp friend answer everi question time stay allston great room cozi even small hous general comfort flat close t station green line b perfect reach downtown everi connect giusepp good host alway care avail wasnt alway joana one roommat perfect substitut realli care need content giusepp excel host avail hospit hes real deal everyth went smooth great quiet space great locat access public transport kind restaur imagin giusepp es una persona muy atenta que respond de forma rápida y cordial cualquier cuestión que pueda tener como dice en el anuncio la habitación es bastant pequeña tien un lavabo una cama y un armario la cama es cómoda en cualquier caso la casa es antigua aunqu está bien conservada y pued parec hasta misteriosa por ejemplo tien una escalera trasera para bajar al sótano la casa está relativament limpia aunqu al ser vario compañero todo tenemo que hacer nuestra part para que siga así la zona es agrad la casa está cerca de commonwealth ave una call llena de tienda restaurant etc y por la que pasa la línea verd de metro también es una zona con mucha vida ya que está al lado de boston univers y hay mucho estudiant universitario el barrio suel ser tranquilo aunqu vece los estudiant que viven cerca o en el mismo bloqu montan fiesta por lo que en ocasion pued haber bastant ambient por la noch la cocina es amplia y está equipada con todo lo necesario para cocinar y aunqu dispon de salón tien un pequeño hall con un sofá bastant cómodo tanto la cocina como el hall son lugar adecuado para trabajar al tener la habitación un escritorio great locat great host clean hous descript accur everyth expect nice neighborhood basic fill colleg student hous kelt clean nice quiet high recommend giusepp alway need someth also great compani felt like right home thing didnt like lack park boston guest mani time life feel strong peopl make stay anyth place advertis room small minimaland allston isnt upscal part boston none matter much taken instant treat regular old housem quick felt home even though far remov actual home youll well your solo travel prefer blend local student young profession without much fuss"
# Take a look at tm's stopwords:
stopwords("english")[1:100]
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"

Step 6: Create a word count matrix (rows are reviews, columns are words).

frequencies = DocumentTermMatrix(corpus)

# Take a look
frequencies
## <<DocumentTermMatrix (documents: 2829, terms: 43667)>>
## Non-/sparse entries: 833265/122700678
## Sparsity           : 99%
## Maximal term length: 365
## Weighting          : term frequency (tf)

Step 7: Account for sparsity.

# Use findFreqTerms to get a feeling for which words appear the most.
# Words that appear at least 10000 times:
findFreqTerms(frequencies, lowfreq=10000)
##  [1] "also"         "apart"        "boston"       "clean"       
##  [5] "close"        "comfort"      "definit"      "easi"        
##  [9] "everyth"      "get"          "good"         "great"       
## [13] "help"         "home"         "host"         "hous"        
## [17] "locat"        "love"         "need"         "neighborhood"
## [21] "nice"         "perfect"      "place"        "realli"      
## [25] "recommend"    "room"         "stay"         "time"        
## [29] "walk"         "well"
# All 45645 terms will not be useful to us. Might as well get rid of some of them - why?
# Solution: only keep terms that appear in x% or more of the reviews
# 5% or more (142 or more)
sparse = removeSparseTerms(frequencies, 0.95)

# How many did we keep? (1136 terms, compared to 45645 previously)
sparse
## <<DocumentTermMatrix (documents: 2829, terms: 1136)>>
## Non-/sparse entries: 598695/2615049
## Sparsity           : 81%
## Maximal term length: 13
## Weighting          : term frequency (tf)
# colnames(sparse)

Step 8: Create data frame.

commentsTM = as.data.frame(as.matrix(sparse))

# View data frame (rows are reviews, columns are words)
str(commentsTM, list.len=10)
## 'data.frame':    2829 obs. of  1136 variables:
##  $ 100          : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ 1015         : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ 1520         : num  0 2 0 0 0 0 0 0 0 0 ...
##  $ 2nd          : num  0 0 0 0 0 0 0 0 0 4 ...
##  $ 3rd          : num  0 2 0 0 0 0 0 2 0 0 ...
##  $ 510          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abl          : num  1 1 2 3 0 3 0 2 0 0 ...
##  $ absolut      : num  0 0 2 2 0 2 0 2 0 1 ...
##  $ accept       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ access       : num  3 5 8 6 0 2 1 0 0 0 ...
##   [list output truncated]
# Drop columns that include numbers
commentsTM = commentsTM[,!grepl("[0-9]",names(commentsTM))]

We have finished building the term frequency data frame commentsTM. Next, we need to merge the two data frames commentsTM (features) and listings_scores (labels) before we can run our machine learning algorithms on this data. This will be a full-join by LISTING_ID.

# Add our dependent variable:
commentsTM$LISTING_ID = reviews_by_listing$listing_id
commentsTM = full_join(listings_scores, commentsTM)
## Joining, by = "LISTING_ID"
# Remove all rows with NA's
commentsTM = na.omit(commentsTM)

# View the first few data frame columns
# Note: Column names corresponding to word frequencies are lowercase,
# while all other column names are uppercase.
names(commentsTM)[1:10]
##  [1] "LISTING_ID"  "RATING"      "RATING_TEXT" "abl"         "absolut"    
##  [6] "accept"      "access"      "accommod"    "accomod"     "accueil"

Exercise 2.2: Your own Bag-of-Words Following steps 0-8, build a Bag-of-Words data frame on the listings description data. Add price as the dependent variable, name it “PRICE”, remove the rows with price=NA, and move this column to the front of the new data frame. (Hint: Construct the term-matrix listingsTM by modifying the NLP code in the file bonus.R.)

Up to here, we have just pre-processed and prepared our data. Now, we are ready to build models.

Building a CART model using Bag-of-Words

Next, we will use our Bag-of-Words to build a CART model to predict review scores. How could a model like this be useful in practice? We will follow the same cross-validation procedure as we did before to select the cp parameter for our CART model. The only difference is that now our features will be word counts, and our predictions will be the discrete values: (“Terrible”,“Low”,“Mid”,“High”,“Perfect”)

To begin, convert RATING_TEXT to a factor variable, and set the order of the level values so that they appear properly in our confusion matrix.

commentsTM$RATING_TEXT = commentsTM$RATING_TEXT %>%
  as.factor() %>%
  ordered(levels=c("Terrible","Low","Mid","High","Perfect"))
str(commentsTM$RATING_TEXT)
##  Ord.factor w/ 5 levels "Terrible"<"Low"<..: 3 4 5 4 5 3 4 4 3 2 ...

Split data into training and testing sets

# install.packages("caTools")
library(caTools)
set.seed(123)
spl = sample.split(commentsTM$RATING_TEXT, SplitRatio = 0.7)
commentsTrain = subset(commentsTM, spl==TRUE)
commentsTest = subset(commentsTM, spl==FALSE)

Let’s use CART! Why CART?

# install.packages("rpart")
library(rpart)
# install.packages("rpart.plot")
library(rpart.plot)

First, train the model using the default parameter values (cp=0.01) Of course, we cannot include RATING or LISTING_ID as predictive variables - why not?

commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID,
                     data=commentsTrain,
                     method="class")
# Display decision tree.  Does it make intuitive sense?
prp(commentsCART)

Next, let’s perform cross-validation on our Bag-of-Words CART model to tune our choice for cp. Useful resource for cross-validation of cp in rpart: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

Step 1: Begin by constructing a tree with a small cp parameter

set.seed(2222)
commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID, 
                     data=commentsTrain,
                     cp=0.001,
                     method="class")

Step 2: View the cross-validated error vs. cp

# In the `printcp()` table:
# "nsplit"    = number of splits in tree
# "rel error" = scaled training error
# "xerror"    = scaled cross-validation error
# "xstd"      = standard deviation of xerror
printcp(commentsCART)
## 
## Classification tree:
## rpart(formula = RATING_TEXT ~ . - RATING - LISTING_ID, data = commentsTrain, 
##     method = "class", cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] access    accommod  accur     advertis  afford    air       although 
##  [8] amaz      anoth     area      away      bathroom  beauti    bed      
## [15] believ    boston    breez     central   clean     close     code     
## [22] comfort   concern   deck      definit   descript  dirti     enjoy    
## [29] enough    ever      everi     everyth   first     four      futur    
## [36] get       good      great     help      high      home      host     
## [43] howev     immacul   late      les       list      live      locat    
## [50] never     offer     one       open      per       price     quaint   
## [57] receiv    recommend respons   room      safe      said      second   
## [64] smaller   smell     space     stay      super     sure      though   
## [71] took      towel     transit   tri       une       visit     weekend  
## [78] wonder    work     
## 
## Root node error: 1413/1941 = 0.72798
## 
## n= 1941 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.2016985      0   1.00000 1.00000 0.013875
## 2  0.0368011      1   0.79830 0.81246 0.015327
## 3  0.0110875      2   0.76150 0.77707 0.015455
## 4  0.0084926      5   0.72824 0.77778 0.015453
## 5  0.0077849      6   0.71975 0.76999 0.015475
## 6  0.0070771     11   0.68082 0.76787 0.015481
## 7  0.0068412     12   0.67374 0.76575 0.015487
## 8  0.0049540     15   0.65322 0.75867 0.015504
## 9  0.0046001     17   0.64331 0.76433 0.015490
## 10 0.0042463     20   0.62916 0.76433 0.015490
## 11 0.0038924     25   0.60722 0.76787 0.015481
## 12 0.0035386     29   0.59165 0.77424 0.015463
## 13 0.0030078     38   0.55909 0.77919 0.015448
## 14 0.0028309     42   0.54706 0.77778 0.015453
## 15 0.0025950     62   0.47841 0.77565 0.015459
## 16 0.0024770     65   0.47063 0.77636 0.015457
## 17 0.0021231     67   0.46568 0.78910 0.015416
## 18 0.0017693     78   0.44232 0.78769 0.015421
## 19 0.0016513     84   0.43171 0.78981 0.015414
## 20 0.0014154     87   0.42675 0.80113 0.015372
## 21 0.0010000     99   0.40977 0.80042 0.015375
# In the `plotcp()` plot:
# size of tree = (number of splits in tree) + 1
# dashed line occurs at 1 std. dev. above the minimum xerror
# Rule of Thumb: select the model size which first
#                goes below the dotted line
plotcp(commentsCART)

Step 3: Prune the tree, and take a look

commentsCART = prune(commentsCART,cp=0.007)
prp(commentsCART)

Step 4: Evaluate model in-sample and out-of-sample accuracy, using a confusion matrix (because this is a classification problem).

# CART on training set:
PredictCARTTrain = predict(commentsCART, type="class")
confusionMatrixTrain = table(commentsTrain$RATING_TEXT, PredictCARTTrain)
confusionMatrixTrain
##           PredictCARTTrain
##            Terrible Low Mid High Perfect
##   Terrible       37  37   7   10      31
##   Low            11 180  44   51      96
##   Mid             4 129  92  166      78
##   High            1  72  34  392      29
##   Perfect         2  33  29   88     288
# Accuracy?
sum(diag(confusionMatrixTrain))/nrow(commentsTrain)
## [1] 0.5095312
# Predictions on test set
PredictCART = predict(commentsCART, newdata=commentsTest, type="class")
confusionMatrix = table(commentsTest$RATING_TEXT, PredictCART)
confusionMatrix
##           PredictCART
##            Terrible Low Mid High Perfect
##   Terrible        4  18   6    2      23
##   Low             5  73  18   29      38
##   Mid             3  62  32   75      29
##   High            1  38  22  149      16
##   Perfect         0  10  18   36     124
# Accuracy?
sum(diag(confusionMatrix))/nrow(commentsTest)
## [1] 0.4596871

Question: How much worse would we have done if we didn’t use cross-validation, and just stuck with the default cp value (0.01)?

Step 5: Compare model to the baseline.

# Most frequent response variable in training set is "High"
# => Baseline accuracy is 0.2720
table(commentsTest$RATING_TEXT)["High"]/nrow(commentsTest)
##      High 
## 0.2719615

Can we improve the accuracy of our model in any way? Let’s try adding a few more features from the listings data frame.

str(listings)
more_features = listings %>%
  select(LISTING_ID=id, SUPER_HOST=host_is_superhost,
         RESPONSE_TIME=host_response_time,
         PRICE=price)
commentsTM = full_join(more_features,commentsTM)
str(commentsTM,list.len=10)

Rerun the CART model with the following code, and check the out-of-sample performance. Does it improve? Why or why not?

set.seed(123)
spl = sample.split(commentsTM$RATING_TEXT, SplitRatio = 0.7)
commentsTrain = subset(commentsTM, spl==TRUE)
commentsTest = subset(commentsTM, spl==FALSE)
commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID,
                     data=commentsTrain,
                     method="class")
prp(commentsCART)

# CART on training set
PredictCARTTrain = predict(commentsCART, type="class")
confusionMatrixTrain = table(commentsTrain$RATING_TEXT, PredictCARTTrain)
# Accuracy?
sum(diag(confusionMatrixTrain))/nrow(commentsTrain)

# Predictions on test set
PredictCART = predict(commentsCART, newdata=commentsTest, type="class")
confusionMatrix = table(commentsTest$RATING_TEXT, PredictCART)
# Accuracy?
sum(diag(confusionMatrix))/nrow(commentsTest)

Exercise 2.3: Bag-of-Words + LASSO
Using the Bag-of-Words constructed in Exercise 2.3, build a LASSO model to predict price based upon listing descriptions only. (Hint: To build the LASSO model, follow the instructions in part 1 of this module.)

k-means Clustering slides

Unsupervised Learning

Thus far, our machine learning task has been to predict labels, which were either continuous-valued (for regression) or discrete-valued (for classification). To do this, we input to the ML algorithms some known (feature, label) examples (the training set), and the ML algorithm outputs a function which enables us to make predictions for some unknown (feature, ?) examples (the testing set). This problem setup is known as Supervised Learning.

Next, we consider Unsupervised Learning, where we are not given labelled examples, and we simply run ML algorithms on (feature) data, with the purpose of finding interesting structure and patterns in the data. Let’s run one of the widely-used unsupervised learning algorithms, k-means clustering, on the listings data frame to explore the Airbnb data set.

# First, let's look at help page for the function `k-means()`:
help(kmeans)

# View the data.frame `listings`:
str(listings, list.len=5)
## 'data.frame':    3585 obs. of  95 variables:
##  $ id                              : int  12147973 3075044 6976 1436513 7651065 12386020 5706985 2843445 753446 849408 ...
##  $ listing_url                     : Factor w/ 3585 levels "https://www.airbnb.com/rooms/10004575",..: 462 1876 2737 1261 2918 543 2402 1830 2890 3196 ...
##  $ scrape_id                       : num  2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
##  $ last_scraped                    : Factor w/ 1 level "2016-09-07": 1 1 1 1 1 1 1 1 1 1 ...
##  $ name                            : Factor w/ 3504 levels " Cozy Spot in Boston's Little Italy",..: 3188 1052 2215 3041 1179 2544 2338 7 344 2489 ...
##   [list output truncated]

Let’s create a new data.frame listings_numeric which has the subset of columns that we wish to cluster on. For the k-means() function, all of these columns must be numeric.

listings_numeric = listings %>% select(id,latitude,longitude,
                                       accommodates, bathrooms,
                                       bedrooms, review_scores_rating,
                                       price) %>%
  na.omit()
str(listings_numeric)
## 'data.frame':    2753 obs. of  8 variables:
##  $ id                  : int  3075044 6976 1436513 7651065 12386020 5706985 2843445 753446 849408 12023024 ...
##  $ latitude            : num  42.3 42.3 42.3 42.3 42.3 ...
##  $ longitude           : num  -71.1 -71.1 -71.1 -71.1 -71.1 ...
##  $ accommodates        : int  2 2 4 2 2 3 2 2 5 2 ...
##  $ bathrooms           : num  1 1 1 1.5 1 1 2 1 1 1 ...
##  $ bedrooms            : int  1 1 1 1 1 1 1 1 2 1 ...
##  $ review_scores_rating: int  94 98 100 99 100 90 96 96 94 80 ...
##  $ price               : num  65 65 75 79 75 100 75 58 229 60 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:832] 1 19 33 37 54 55 57 66 72 74 ...
##   .. ..- attr(*, "names")= chr [1:832] "1" "19" "33" "37" ...

Next, run the k-means algorithm on the numeric data.frame, with k = 5 cluster centroids:

# k-means clustering
set.seed(1234)
kmeans_clust = kmeans(listings_numeric[,-1:-3],5, iter=1000, nstart=100)
kmeans_clust

Look at the average values within the clusters. What are the characteristics of these 5 groups? How many listings are in each cluster?

kmeans_clust$centers
##   accommodates bathrooms  bedrooms review_scores_rating     price
## 1     2.105873  1.160050 0.9875931             91.00331  80.53515
## 2     3.197987  1.121924 1.1174497             92.54586 170.09284
## 3     4.289109  1.306931 1.7287129             92.82574 274.25941
## 4     6.468750  2.328125 3.0312500             96.53125 712.96875
## 5     5.876106  1.823009 2.5044248             92.34513 426.60177
table(kmeans_clust$cluster)
## 
##    1    2    3    4    5 
## 1209  894  505   32  113

Finally, let’s display the clusters geographically using the (latitude, longitude) data. First, use ggmap to obtain a map of Boston. Adapted from https://www.r-bloggers.com/drug-crime-density-in-boston/

# install.packages("ggmap")
# devtools::install_github("dkahle/ggmap")
library(ggmap)
# requires internet connection
bos_plot=ggmap(get_map('Boston, Massachusetts',zoom=13,source='google',maptype='terrain'))
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boston,+Massachusetts&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boston,%20Massachusetts&sensor=false
bos_plot

To get good color schemes, use RColorbrewer.

# install.packages("RColorbrewer")
library(RColorBrewer)
display.brewer.all()

Plot maps and Airbnb locations using the ggplot syntax.

bos_plot +
  geom_point(data=listings_numeric,aes(x=longitude,y=latitude,colour=factor(kmeans_clust$cluster)),
             alpha=.5,size=1) +
  xlab("Longitude") + ylab("Latitude") +
  scale_colour_brewer("Cluster", palette = "Set1")

Can you see where the clusters are? Also, what is the proper number of clusters? We will revisit this in the next session, because it requires some more advanced tidyverse tools. Stay tuned!

In this module, we have covered examples of machine learning methods for linear regression, LASSO, CART, and k-means. This is just the tip of the iceberg. There are tons more machine learning methods which can be easily implemented in R. We provide some bonus R code for random forest, regularized logistic regression, and SVM applied to the Airbnb data set in the file bonus.R.